audio and video
- North America > United States > California > San Diego County > San Diego (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- (4 more...)
Coherent Audio-Visual Editing via Conditional Audio Generation Following Video Edits
Ishii, Masato, Hayakawa, Akio, Shibuya, Takashi, Mitsufuji, Yuki
W e introduce a novel pipeline for joint audio-visual editing that enhances the coherence between edited video and its accompanying audio. Our approach first applies state-of-the-art video editing techniques to produce the target video, then performs audio editing to align with the visual changes. T o achieve this, we present a new video-to-audio generation model that conditions on the source audio, target video, and a text prompt. W e extend the model architecture to incorporate conditional audio input and propose a data augmentation strategy that improves training efficiency. Furthermore, our model dynamically adjusts the influence of the source audio based on the complexity of the edits, preserving the original audio structure where possible. Experimental results demonstrate that our method outperforms existing approaches in maintaining audio-visual alignment and content integrity.
ProAV-DiT: A Projected Latent Diffusion Transformer for Efficient Synchronized Audio-Video Generation
Sun, Jiahui, Wang, Weining, Sun, Mingzhen, Yang, Yirong, Zhu, Xinxin, Liu, Jing
Sounding Video Generation (SVG) remains a challenging task due to the inherent structural misalignment between audio and video, as well as the high computational cost of multimodal data processing. In this paper, we introduce ProAV-DiT, a Projected Latent Diffusion Transformer designed for efficient and synchronized audio-video generation. To address structural inconsistencies, we preprocess raw audio into video-like representations, aligning both the temporal and spatial dimensions between audio and video. At its core, ProAV-DiT adopts a Multi-scale Dual-stream Spatio-Temporal Autoencoder (MDSA), which projects both modalities into a unified latent space using orthogonal decomposition, enabling fine-grained spatiotemporal modeling and semantic alignment. To further enhance temporal coherence and modality-specific fusion, we introduce a multi-scale attention mechanism, which consists of multi-scale temporal self-attention and group cross-modal attention. Furthermore, we stack the 2D latents from MDSA into a unified 3D latent space, which is processed by a spatio-temporal diffusion Transformer. This design efficiently models spatiotemporal dependencies, enabling the generation of high-fidelity synchronized audio-video content while reducing computational overhead. Extensive experiments conducted on standard benchmarks demonstrate that ProAV-DiT outperforms existing methods in both generation quality and computational efficiency.
- North America > United States > California > San Diego County > San Diego (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- (4 more...)
Controllable Audio-Visual Viewpoint Generation from 360° Spatial Information
Marinoni, Christian, Gramaccioni, Riccardo Fosco, Grassucci, Eleonora, Comminiello, Danilo
The generation of sounding videos has seen significant advancements with the advent of diffusion models. However, existing methods often lack the fine-grained control needed to generate viewpoint-specific content from larger, immersive 360-degree environments. This limitation restricts the creation of audio-visual experiences that are aware of off-camera events. To the best of our knowledge, this is the first work to introduce a framework for controllable audio-visual generation, addressing this unexplored gap. Specifically, we propose a diffusion model by introducing a set of powerful conditioning signals derived from the full 360-degree space: a panoramic saliency map to identify regions of interest, a bounding-box-aware signed distance map to define the target viewpoint, and a descriptive caption of the entire scene. By integrating these controls, our model generates spatially-aware viewpoint videos and audios that are coherently influenced by the broader, unseen environmental context, introducing a strong controllability that is essential for realistic and immersive audio-visual generation. We show audiovisual examples proving the effectiveness of our framework.
- Media (0.69)
- Leisure & Entertainment (0.46)
Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising
Lin, Yan-Bo, Lin, Kevin, Yang, Zhengyuan, Li, Linjie, Wang, Jianfeng, Lin, Chung-Ching, Wang, Xiaofei, Bertasius, Gedas, Wang, Lijuan
In this paper, we introduce zero-shot audio-video editing, a novel task that requires transforming original audio-visual content to align with a specified textual prompt without additional model training. To evaluate this task, we curate a benchmark dataset, AvED-Bench, designed explicitly for zero-shot audio-video editing. AvED-Bench includes 110 videos, each with a 10-second duration, spanning 11 categories from VGGSound. It offers diverse prompts and scenarios that require precise alignment between auditory and visual elements, enabling robust evaluation. We identify limitations in existing zero-shot audio and video editing methods, particularly in synchronization and coherence between modalities, which often result in inconsistent outcomes. To address these challenges, we propose AvED, a zero-shot cross-modal delta denoising framework that leverages audio-video interactions to achieve synchronized and coherent edits. AvED demonstrates superior results on both AvED-Bench and the recent OAVE dataset to validate its generalization capabilities. Results are available at https://genjib.github.io/project_page/AVED/index.html
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Asia (0.04)
- Media (0.47)
- Leisure & Entertainment (0.46)
UniForm: A Unified Diffusion Transformer for Audio-Video Generation
Zhao, Lei, Feng, Linfeng, Ge, Dongxu, Yi, Fangqiu, Zhang, Chi, Zhang, Xiao-Lei, Li, Xuelong
As a natural multimodal content, audible video delivers an immersive sensory experience. Consequently, audio-video generation systems have substantial potential. However, existing diffusion-based studies mainly employ relatively independent modules for generating each modality, which lack exploration of shared-weight generative modules. This approach may under-use the intrinsic correlations between audio and visual modalities, potentially resulting in sub-optimal generation quality. To address this, we propose UniForm, a unified diffusion transformer designed to enhance cross-modal consistency. By concatenating auditory and visual information, UniForm learns to generate audio and video simultaneously within a unified latent space, facilitating the creation of high-quality and well-aligned audio-visual pairs. Extensive experiments demonstrate the superior performance of our method in joint audio-video generation, audio-guided video generation, and video-guided audio generation tasks. Our demos are available at https://uniform-t2av.github.io/.
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Sensing and Signal Processing > Image Processing (0.93)
- Information Technology > Artificial Intelligence > Natural Language (0.68)
Review for NeurIPS paper: Labelling unlabelled videos from scratch with multi-modal self-supervision
Weaknesses: Required clarifications: there are some parts of the work that would require clarification, see below: * The description of the exact algorithm is not completely clear to me in the paper (and the appendix). I understand that code is provided but it should be clarified in the paper. In particular, is it a pure alternate approach? How many examples are sampled for the clustering stage (is N equal to the number of example in the dataset?) If I understand correctly, thanks to the probabilistic formulation, once the data is reclustered there is no need to reinit the last linear layer, is that correct? - If no, it is unclear to me how to apply the algorithm in an online fashion (see later for a related question).